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Abstract. This paper describes a new accumulate-and-add multiplication algorithm. 
The method partitions one of the operands and re-combines the results of compu- 
tations done with each of the partitions. The resulting design turns-out to be both 
compact and fast. 

When the operands' bit-length m is 1024, the new algorithm requires only 0.194m + 56 
additions (on average), this is about half the number of additions required by the 
classical accumulate-and-add multiplication algorithm (^). 



1 Introduction 

Binary multiplication is one of the most fundamental operations in digital electron- 
ics. Multiplication complexity is usually measured by bit additions, assumed to have 
a unitary cost. 

Consider the task of multiplying two m-bit numbers A and B by repeated accu- 
mulations and additions. If A and B are chosen randomly (i.e. of expected Hamming 
weight w = m/2) their classical multiplication is expected to require w(B) = m/2 
additions of A. 

The goal of this work is to decrease this work-factor by splitting B and batch- 
processing its parts. The proposed algorithm is similar in spirit to common-multiplicand 
multiplication (cmm) techniques [1], [2], [3], [4]. 



* most of the work has been done while this author was working at the Max-Planck Institut fur 
Mathematik (mpim Bonn, Germany) 



B 2 = 101010 Bi = 100011 



Fig. 1. Venn diagram of characteristic vectors 



2 Proposed Multiplication Strategy 

We first extend the exponent-folding technique [5], suggested for exponentiation, 
to multiplication. A similar approach has been tried in [3] to fold the multiplier 
into halves. Here we provide an efficient and generalized operand decomposition 
technique, consisting in a memory-efficient multiplier partitioning method and a 
fast combination method. For the sake of clarity, let us illustrate the method with 
a toy example. As the multiplicand A is irrelevant in estimating the work-factor (A 
only contributes a multiplicative constant), A will be omitted. 



2.1 A Toy Example 

Let m = 2 • n and B = 101010100011 2 = B 2 \\B 1 = 

For i, j € {0, 1}, set B^ := {S5S4S3S2S1S0} with s r = 1 iff b% = i and b\ = j. 
That is, is the characteristic vector of the column (ij) T in the 2 by | array 
formed by B2 and B\ in parallel. Hence, 



B m = 010100, B {01) = 000001, B {10) = 001000, B {11) = 100010. 

Note that all of -B(oo)i ^(oi)* ^(10) j an d ^(11) are bitwise mutually exclusive, or 
disjoint. All these characteristic vectors except -B(oo) can be visualized in a natural 
way as a Venn diagram (see Fig. 1). Hence, B\ and B2 can be represented as 

B l = ^ B (il) = B (Q1) + ^(11), B 2 = XT B ( 1 3) = B ( 10 ) + B ( 11 )- 

ie{o,i} je{o,i} 

Now, the multiplication of A by B can be parallelized essentially by multiplying 
A by -£>(oi)> -S(io)' an d -^(11)! the final assembly of the results of these multiplications 
requires a few additions and shifts. Namely, 



A x B = A x (2 n • B 2 + Bi) = 2 n (A x B 2 ) + A x ^ 
= 2™(A x B (10) + A x + A x 5(01) + A x 

where 2™ • z can be performed by an n-bit left shift of z. 

All these procedures are summarized in Algorithm 1. Note that Algorithm 1 
eliminates the need of storage for characteristic vectors by combining the parti- 
tioning into characteristic vectors and the parallel evaluation of several A x B^ 
computations. 

Accumulate-and-add multiplication by operand-folding in half 

Input: m-bit integers A and B = i^H-Bi, where Bi = (b t n _ 1 ■ ■ ■ b\b l ) and n = m/2 
Output: C = A x B 

1 C(oi) <~ C(io) ^- C(ii) ^- 
2-1 for i = to n — 1 do 
2-2 if (ft^i)/ (00) 

2-3 ^(bf&l) ^ C (bpj) + ^ 

2- 4 A <- A < 1 

3- 1 C(io) ^~ C(io) + C(n) 
3-2 C( i) <- C(oi) + C(n) 

4 C <- (C ( io) < n) + C ( oi) 

Suppose that both A and B are m-bit integers and each Bi is an y-bit inte- 
ger. On average, the Hamming weights of Bi and Buj\ are ^ and ^, respectively. 
For evaluating Ax B, Algorithm 1 requires + 3 additions without taking into 
account shift operations into account. Hence, performance improvement over clas- 
sical accumulate-and-add multiplication is 3 ™/g +3 ~ |. In exchange, Algorithm 1 
requires three additional temporary variables. 

2.2 Generalized Operand Decomposition 

Let B be an m-bit multiplier having the binary representation (6 m _i • • • &160), i.e., 
B = Y^1=(^ bi2 t where bi € {0, 1}. By decomposing B into k parts, B is split into 
k equal-sized substrings as B = Bk\\ • • • \\B2WB1, where each Bi, represented as 
(^71-1 ' ' ' is n = fxl"' 3 ^ 8 l° n g- If m i s not a multiple of k, then B^ is left- 

padded with zeros to form an n-bit string. Hence, 

k 

ixB = ^2"( , - 1 »(Ax5 i ). (1) 



By Horner's rule, equation 1 can be rewritten as 



AxB = 2 n (2 n (- • • (2 n (A xB k ) + Ax B k _{) ■ ■ ■) + A x B 2 ) + A x B x . (2) 

The problem is now reduced into the effective evaluation of the {A x Bi\i = 
1, 2, . . . , k; k > 2} in advance, which is known as the common-multiplicand multi- 
plication (cmm) problem. For example [1,2,4] dealt with the case k = 2, and [3] 
dealt with the case k = 3 or possibly more. In this work we present a more general 
and efficient CMM method. 

As in the toy example above, the first step is the generation of 2 k disjoint char- 
acteristic vectors Bu^..^ from the k decomposed multipliers Bi. Each Bt ik ..^ is n 
bits long and of average Hamming weight n/2 k . Note that, as in Algorithm 1, no 
additional storage for the characteristic vectors themselves is needed in the parallel 
computation of the A x B^ ik ... il ^s. 

The next step is the restoration of A x Bj for 1 < j < k using the evaluated 
values C(j fc ...jj) = A x B^ ik ... il y The decremental combination method proposed in 
[6] makes this step more efficient than other methods used in CMM. For notational 
convenience, C( ...oi J —i 1 ) can simply be denoted as C^....^ by omission of zero runs 
on its left side, and C^ k ... il ^ can be denoted as where (i k ■■■ii) is the binary 
representation of a non-negative integer i. Then A x Bj for j = k, . . . , 1 can be 
computed by 

A x B i = Yl Qib-i-u)' 

(ij-i-h) 

Cfe-i-ii) = Cfe-i-ii) + C^-i-ii), V(ij_i ■■■h). 

Figure 2 shows the combination process for a case k = 3 with Venn diagrams. 

The last step is the application of Horner's rule on the results obtained from 
the above step. The overall procedure to compute A x B is given in Algorithm 2. 
Note that Algorithm 2 saves memory by recycling space for evaluated characteristic 
vectors, without use of temporary variables for A x B{. 

Accumulate-and-add multiplication by generalized operand decomposition 

Input: m-bit integers A and B = B k \\ ■ ■ ■ \\B±, where Bi = (b' l n _ 1 ■ ■ ■ b\b l ) and n = [ 
Output: C = AxB 

1 C (ik ... h) ^0 foraJl(i fc ---ii)^(0---0) 

2-1 for i = to n — 1 do 

2-2 if (b k ---bj) + (0---0) 

2-3 C {h k ... b i } <- C (6 fc... 6 i) + ^4 

2-4 A ^4 < 1 



A x B 3 — C( 100 ) + C (101 ) 




AxB 2 = C (10) + C (11) AxBi = C (1) 



Fig. 2. Venn diagram representation for combination process when k — 3 

3-1 for z = k down to 1 do 
3-2 for j = 1 to 2* 1 - 1 do 

3-3 C(2*- 1 ) C(2 i - 1 ) + C(2 i - 1 +j) {C(2 i - 1 ) corresponds to A x 

3- 4 ^ C (j) + C* (2 <-i +j) 

4- 1 C ^ — C^fc-i) 

4-2 for i = k — 1 down to 1 do 
4-3 C 4- C < n 

4-4 C ^ C + C (2 i-i) 



3 Theoretical Asymptotic Analysis 

It is interesting to determine how the actual number of additions necessary to per- 
form a multiplication decreases as parallelization increases. Neglecting the additions 
required to recombine the parallelized results, the number of additions tends to zero 
as the degree of parallelism k increases. The convergence is slow, namely: 

log k log log m 
k logm 

since k < log m is required to avoid edge effects. In practice if the operand is split 
into an exponential number of sub-blocks (actually 3 fc ) the total Hamming weight 
of the blocks will converge to zero. 

To understand why things are so, we introduce the following tools: 

Let 5q <G [0, tj] and 5i+\ = <5j(l — <5j) then 



lim Si = 



More precisely, Si = and 



n-l 

^ <5f = So - S n => ^2 5 i = 5 o 

i=0 i=0 

Let B have length 6 and density Si, i.e. weight Sib. After performing the split- 
ting process, we get three blocks, -B( 10 ), -B(oi) an d -S(n) of length | and respective 
densities <5j+i = Si(l — Si) for the first two and Sf for 5(h)- The total cost of a 
multiplication is now reduced from Sib to 

c , 

In other words, the gain of this basic operation is nothing but the Hamming 
weight of -B(n) : 

Sjb 
2 

Graphically, the operation can be regarded as a tree with root B, two nodes 
-B( 10 ), -E>(oi) and a leaf -B(n). The gain is the Hamming weight of the leaf. 

We will now show that by iterating this process an infinity of times, the total 
gain will converge to the Hamming weight of B. 



3.1 First Recursive Iteration of the Splitting Process 

Apply the splitting repeatedly to the nodes: this gives a binary tree having two 
nodes and one leaf at level one, and more generally 2 J nodes and 2 J_1 leaves at level 
j. The gain jij of this process is the sum of the weights of the A r i J - = 2 J ' — 1 leaves, 
that is: 



i=0 

As j increases we get an infinite tree A\ , a gain of 

bS 



71 2 



and a total weight of 



3.2 Second Recursive Iteration of the Splitting Process 

We now apply the previous recursive iteration simultaneously (in parallel) to all 
leaves. Note that each leaf from the previous step thereby gives rise to 1 + 2 + . . . + 
2 s + . . . new leafs. In other words, neglecting edge effects we have N2J ~ ^1 j- 

The last step consists in iterating the splitting process i times and letting i tend 
to infinity. By analogy to the calculations of the previous section the outcome is an 
extra gain of: 

72 = W 2 = — 
Considering Wt and letting t — > 00, we get a total gain of: 

r = J2ii = 2w i = b5 

i 

Thus a non-intuitive phenomenon occurs: 

— Although Nij m Nlj, eventually the complete ternary tree T is covered, hence 
there are no pending leaves. 

— The sum of an exponential number of weights (3 fc with k — > 00) tends to zero. 

3.3 Speed of Convergence 

The influence of truncation to a level k < log n is twofold: 

— The recursive iterations Ri are limited to i = k, thus limiting the number of 
additional gains 7, to 7&. 

— Each splitting process is itself limited to level k, thus limiting each additional 
gain 7i, 1 < i < k to 7^. 

Let us estimate these two effects: 

k , 

k<\ogn- loglogn => T fc = V < <5 (1 ) 

i=i 

k 

k > log n - log log n =>= ^ 7i ~ 7i,fe > ( lo S n ~ lo S lo S n ) min(7i - 7^) 

i=i 

But 

min(7i - 7^) » — (1 - o(l)) 
Hence the global weight tends to zero like 0(^f^)- 



Table 1. Optimal k for F as a function of m 



Optimal k 


Range of m 


f Favg(m,fc)/Favg(m,l) 


■mF wst (m, k)/F wst (m, 1) 


2 


24 < m < 83 


0.375m + 3 


0.500m + 3 


3 


84 < m < 261 


0.292m + 10 


0.333m + 10 


4 


262 < m < 763 


0.234m + 25 


0.250m + 25 


5 


764 < m < 2122 


0.194m + 56 


0.200m + 56 



4 Performance Analysis and Comparison 

Accumulate-and-add multiplication performance is proportional to the number of 
additions required. Hence, we analyze the performance of the proposed multiplica- 
tion algorithm. 

In step 2, as the average Hamming weight of each characteristic vector is n/2 k , 
where n = \m/k~\, the number of additions needed to multiply A by 2 k — 1 disjoint 
characteristic vectors in parallel is (2 k — 1) • ^ on average. In step 3, the computation 
of every A x Bi by combination of the evaluated characteristic vectors requires the 
following number of additions: 

k k 

2(2 i - 1 - 1) = J^(2 i - 2) = 2 k+1 -2k -2, 
i=i i=i 

whereas the method used in [3] requires k{2 k ~ 1 — 1) additions. In step 4, the 
completion of A x B using Horner's rule requires k — 1 additions. Therefore, the 
total number of additions needed to perform the proposed algorithm is on average 
equal to: 



n , 2 K -1 
F avg (m,fc) = — ^ 



+ 2 k+1 - k - 3. 



"m" 
~k 

On the other hand, F wst (m, k) = |~^] + 2 k+1 - k - 3 in the worst case. 

Performance improvement over the classical accumulate-and-add multiplication 
algorithm is asymptotically: 

lim w) = lim ^11 = t^L 

rr^ooF mg {m,k) m^oo 2^1 . Tmj + 2 fc+l _ - 3 2 fe - 1 

Larger k values do not necessarily guarantee the better performance, because 
the term 2 k+l — k — 3 increases exponentially with k. Thus, a careful choice of k is 
required. The analysis of -F avg for usual multiplier sizes m yields optimal k values 
that minimize F avg . The optimal k values as a function of m are given in Table 1. 



Table 1 also includes comparisons with the classical algorithm for the both the case 
and the worst cases. 

In modern public key cryptosystems, m is commonly chosen between 1024 and 
2048. This corresponds to the optimum k = 5 i.e. an 2.011 to 2.260 performance 
improvement over the classical algorithm and 1.340 to 1.560 improvement over the 
canonical signed digit multiplication algorithm [7] where the minimal Hamming 
weight of is y on the average. 

On the other hand, the proposed algorithm requires storing 2 k — 1 temporary 
variables, which correspond to 0((2 k — l)(m + n + fc))-bit memory. Whenever k > 3, 
although optimal performance is not guaranteed, the new algorithm is still faster 
than both classical and canonical multiplication. 
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A Hardware Implementation 



LIBRARY IEEE; USE ieee . std_logic_1164. all ; USE leee . std_logic_unsigned. all ; 

ENTITY Mult_Entity IS 

GENERIC (CONSTANT m : NATURAL := 32; 

CONSTANT k : NATURAL := 2); 
PORT (A : in STD_LOGIC_VECTOR (m-1 DOWNTO 0); 
B : in STD_LOGIC_VECTOR (m-1 DOWNTO 0) ; 
C : out STD_L0GIC_VECT0R(2*m-l DOWNTO 0)); 

END Mult_Entity; 

ARCHITECTURE Behavioral OF Mult_Entity IS 
SIGNAL n : NATURAL := m+k-l/k; 
SIGNAL INPUT.LENGTH : NATURAL := n*k; 
SIGNAL OUTPUT_LENGTH : NATURAL := 2*INPUT_LENGTH; 
SIGNAL C_TEMP : STD_L0GIC_VECT0R(2*INPUT_LENGTH-1 DOWNTO 0) ; 
SIGNAL C_PARTS_LENGTH : NATURAL := INPUT_LENGTH+n ; 
SIGNAL A_TEMP : STD_L0GIC_VECT0R(C_PARTS_LENGTH-1 DOWNTO 0) ; 
SIGNAL B_value : INTEGER; 

TYPE BX_TYPE IS ARRAY (k DOWNTO 1) OF STD_LOGIC_VECTOR(n-l DOWNTO 0) ; 
SIGNAL BX : BX_TYPE; 

SIGNAL cx_count : NATURAL := 2**k-l; 

TYPE CX_TYPE IS ARRAY (cx_count DOWNTO 1) OF STD_L0GIC_VECT0R(C_PARTS_LENGTH-1 DOWNTO 0) ; 
SIGNAL CX : CX.TYPE; 

BEGIN 

Myproc : PROCESS ( A, B) 

VARIABLE i, j : INTEGER := 0; 
BEGIN 

FOR i IN 1 TO k-1 LOOP BX(i)(n-l DOWNTO 0) <= B(i*n-1 DOWNTO (i-l)*n); END LOOP; 

BX(k) (m-(n*(k-l))-l DOWNTO 0) <= B(m-1 DOWNTO m-n* (k-1) ) ; 
IF ((m MOD k)>0) THEN BX(k)((n-l) DOWNTO (n-l-(m MOD k))) <= "0"; END IF; 
A_TEMP (m-1 DOWNTO 0) <= A; A_TEMP (C_PARTS_LENGTH- 1 DOWNTO m) <= "0"; 

—STEP 1 

For i IN 1 TO 2**k-l LOOP CX(i) <= "0"; END LOOP; 
—STEP 2-1 

For i IN TO n-1 LOOP 

B.value <= 0; 

FOR j IN 1 TO k LOOP 

IF ((BX(j) (i))='l>) THEN B_value <= B_value + 2**(j-l); END IF; 

END LOOP; 
—STEPS 2-2 and 2-3 

IF (B_value>0) THEN CX (B.value) <= CX (B.value) + A_TEMP; END IF; 
—STEP 2-4 

A_TEMP <= A_TEMP (C_PARTS_LENGTH-2 DOWNTO 0)&"0"; 
END LOOP; 

—STEP 3-1 

FOR i IN k DOWNTO 1 LOOP 
—STEP 3-2 

FOR j IN 1 TO 2**(i-l)-l LOOP 
—STEP 3-3 

CX(2**(i-l)) <= (CX(2**(i-l)) + CX(2**(i-l)+j)) ; 



—STEP 3-4 

CX(j) <= (CX(j) + CX(2**(i-l)+j)); 
END LOOP; 
END LOOP; 
—STEP 4-1 

C_TEMP (C_PARTS_LENGTH- 1 DOWNTO 0) <= CX(2** (k-1) ) ; 
C_TEMP (n-1 DOWNTO C_PARTS_LENGTH- 1 ) <= "0" ; 
—STEP 4-2 

FOR i IN k-1 DOWNTO 1 LOOP 
—STEP 4-3 

C_TEMP <= C_TEMP(2*m-l-n DOWNTO 0) & "0" ; 
—STEP 4-4 

C_TEMP <= C_TEMP + CX(2** (i-1) ) ; 
END LOOP; 

END PROCESS Myproc; 
C <= C_TEMP; 
END Behavioral; 



